-
-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace globby with much faster code when gitignore enabled #426
Conversation
Someone is attempting to deploy a commit to a Personal Account owned by @webpro on Vercel. @webpro first needs to authorize it. |
Hi @phiresky, this is fantastic! 🔥 Thank you so much for doing this. This is such a huge win, happy to replace globby and merge your PR. I don't have an issue with maintaining that here. Just a few points:
|
c579511
to
0120410
Compare
I've rebased this on master.
If that's only happening since this branch then it's probably related to some path joining doing something with \ and not /. Globby I think has some logic to convert \ to / so there's probably some inconsistency now.
It should fix access errors for files / directories that are in ignored directories yes. If there's an access error in a file that's part of the repo then no. (But that seems like it probably shouldn't be ignored anyways)
Note that this potentially can still decrease the performance against I don't think the impact should be much though because it's a single scan that (on my larger repo) only takes <1s. fast-glob optimizes patterns I think by e.g. parsing them into path segments and matching those faster but not sure. |
Thanks!
I have a Windows laptop right here, I'll look into it.
Yeah that's what I meant, at least users can ignore problematic dirs/files.
Thanks, I'll benchmark a bit. But the overall win is a no-brainer 🙏 On a related note, have you seen https://github.com/fabiospampinato/fast-ignore? If not, might have some good ideas as well. |
Easy fix up for Windows, to be continued.. Just for the record: diff --git a/packages/knip/src/util/globby.ts b/packages/knip/src/util/globby.ts
index 1fc3ef1e..2b9fb378 100644
--- a/packages/knip/src/util/globby.ts
+++ b/packages/knip/src/util/globby.ts
@@ -37,7 +37,7 @@ function convertGitignoreToMicromatch(pattern: string, base: string) {
function parseGitignoreFile(filePath: string, cwd: string) {
const file = fs.readFileSync(filePath, 'utf8');
- const base = path.relative(cwd, path.dirname(filePath));
+ const base = path.relative(cwd, path.dirname(path.toPosix(filePath)));
return file
.split(/\r?\n/)
|
When trying this on other repos, I've noticed some where indeed faster, but others were - surprisingly - slower than before. A couple of things I tried and seem to be working well:
I've benchmarked by wrapping e.g. WDYT? diff --git a/packages/knip/package.json b/packages/knip/package.json
index f6fc237d..d9b5dc16 100644
--- a/packages/knip/package.json
+++ b/packages/knip/package.json
@@ -54,7 +54,6 @@
],
"dependencies": {
"@ericcornelissen/bash-parser": "0.5.2",
- "@nodelib/fs.walk": "2.0.0",
"@npmcli/map-workspaces": "3.0.4",
"@npmcli/package-json": "5.0.0",
"@pkgjs/parseargs": "0.11.0",
@@ -71,6 +70,7 @@
"pretty-ms": "8.0.0",
"strip-json-comments": "5.0.1",
"summary": "2.1.0",
+ "tiny-readdir-glob": "1.2.1",
"zod": "3.22.4",
"zod-validation-error": "2.1.0"
},
diff --git a/packages/knip/src/util/globby.ts b/packages/knip/src/util/globby.ts
index 2b9fb378..8300d42c 100644
--- a/packages/knip/src/util/globby.ts
+++ b/packages/knip/src/util/globby.ts
@@ -1,13 +1,12 @@
import * as fs from 'fs';
-import { promisify } from 'node:util';
-import { walk as _walk } from '@nodelib/fs.walk';
import { type Options as FastGlobOptions } from 'fast-glob';
import fastGlob from 'fast-glob';
import micromatch from 'micromatch';
+import readdir from 'tiny-readdir-glob';
+import { GLOBAL_IGNORE_PATTERNS } from '../constants.js';
import { debugLogObject } from './debug.js';
import * as path from './path.js';
-const walk = promisify(_walk);
type Options = {
/** Respect ignore patterns in `.gitignore` files that apply to the globbed files. */
readonly gitignore?: boolean;
@@ -48,23 +47,19 @@ type Gitignores = { ignores: string[]; unignores: string[] };
/** walks a directory, parsing gitignores and using them directly on the way (early pruning) */
async function parseFindGitignores(options: Options): Promise<Gitignores> {
- const ignores: string[] = [];
+ const ignores: string[] = ['.git', ...GLOBAL_IGNORE_PATTERNS];
const unignores: string[] = [];
- const consideredFiles: string[] = [];
- await walk(options.cwd, {
- entryFilter: entry => {
- if (entry.dirent.isFile() && entry.name === '.gitignore') {
- consideredFiles.push(entry.path);
- for (const rule of parseGitignoreFile(entry.path, options.cwd))
- if (rule.negated) unignores.push(...rule.patterns);
- else ignores.push(...rule.patterns);
- return true;
- }
- return false;
- },
- deepFilter: entry => !micromatch.any(path.relative(options.cwd, entry.path), ignores, { ignore: unignores }),
+ const result = await readdir('**/.gitignore', {
+ cwd: options.cwd,
+ ignore: (entry: string) => micromatch.any(path.relative(options.cwd, entry), ignores, { ignore: unignores }),
});
- debugLogObject(options.cwd, 'gitignore files', { consideredFiles, ignores, unignores });
+ for (const filePath of result.files) {
+ for (const rule of parseGitignoreFile(filePath, options.cwd)) {
+ if (rule.negated) unignores.push(...rule.patterns);
+ else ignores.push(...rule.patterns);
+ }
+ }
+ debugLogObject(options.cwd, 'gitignore files', { consideredFiles: result.files, ignores, unignores });
return { ignores, unignores };
}
const cachedIgnores = new Map<string, Gitignores>(); |
I'm a bit surprised with your results . What are the absolute numbers you're looking at? Is it more total like 100ms or total 30s+? Could you also say how many directories total are in your test dir and how many are gitignored? That @nodelib/fs.walk library is what fast-glob also uses internally so I'm confused why it's not fast. Slower than before is very surprising because as I said the previous method scans through the directory many times, once for each plugin matched times for each package (afaik). Are they very simple repos? Maybe it's not the library but just the fact that your code isn't using the I can benchmark it tomorrow but just looking at your patch I don't think it's going to work well on my repo - the issue is that you're not building the ignore list incrementally while walking and thus pruning the walk tree directly - that's probably why you also need to have "node_modules" in your pre-created ignore list. With my method it shouldn't matter because the root gitignore is going to be added to the ignore list before any deeper directories are traversed. I know that pre-adding node_modules to all kinds of ignore lists is a neat hack that tons of JS projects use but imo it's a bit of a crappy workaround that indicates the code is unnecessarily inefficient. (try renaming the node_modules directory, moving it to a subdir and gitignoring it there). For example if you have tons of build artifacts outside of node_modules. There's a compromise where you parse the root .gitignore first and then walk the rest of the tree non-incrementally. Since most people seem to not even know about .gitignore files in non-root directories it's probably an ok heuristic but it's a bit ugly because then for some cases it's still going to be slow. |
You can also try the opposite, add the pruning entryFilter to your tiny-readdir-glob implementation,something lke this: ignore: (entry: string) => {
if (micromatch.any(...)) return true;
if (basename(entry)) === ".gitignore")
// do the parseGitignoreFile steps from above to parse the gitignore and add it directly to ginores
return false;
} |
Here are some numbers. "knip" means this repo at 92c1ac0035e3a63ab4776d224d3d31f96e8bcdc8, "kadena.js" (which the "boost doc" was inspired from) is https://github.com/kadena-community/kadena.js at 06315e20c0b750867965fbe00f0e9babce8bef1b. But that doesn't matter too much, what matters most is the number of invocations and the sums/total time. I've "timerified"
knip
kadena.js
knip
kadena.js
knip
kadena.js
|
Quick question, why is |
I've yanked bottom lines just to reduce noise |
@phiresky I'm very excited to get this merged and include it with the v4 release I'm preparing. Also, I'd really like to thank you for pushing this and attribute the whole pull request 100% to you (that's also why I've pasted the diffs here). Any chance you could still give it a shot on your workload? Are you in favor of adding my changes to your PR? |
Sorry, I got sidetracked by other stuff. Here's a comparison benchmark with your change above: running on my repomy branch, unmodified except for timerify(parseFindGitignores):
Total running time: 2m 7.1s (mem: 3549.74MB) your patch: (i'm sorry, I waited for 15 minutes and it's still in "Connecting the dots..." so i killed it. no results) Only after moving out a huge gitignored backup directory from the repo, these are the results:
Total running time: 5m 28.8s (I guess most of the time is spent in untracked functions) running on kadena.jsmy branch:
your patch:
So yeah, I can reproduce your patch being faster on kadena.js. But it's much slower on my repo with large amounts of gitignored directories. new versionIt's pretty easy to see that parseFindGitignores spends almost all of it's time in deepFilter. The only thing that does is call micromatch.any. Looking at that function in micromatch, it's pretty dumb. It constructs a picomatch matcher and then immediately calls it. If we instead construct that picomatch matcher ourself once per gitignore and use the same instance for every directory, these are the results on kadena.js:
That's still slightly slower but much less significant in the grand scheme of things. I guess micromatch is just a bad wrapper around picomatch. Now again here's an overview table of my results
(An additional improvement looks like it comes from updating the isGitIgnoredFn() function to use picomatch as well) Please try the updated branch and let me know your results. aab7788 |
Awesome, the latest version is faster in every instance, compared to both the Very glad we made it through, thanks again! I'm going to merge this and fix any remaining issues for Windows. |
Had to fix a few things for Windows and then some, but this is published in v4.0.0-canary.10 |
Bummer, I knew I forgot something. Linting the https://github.com/TanStack/query repository (already using Knip, with 73 workspaces) went up from 4.7 to 16.1 seconds. Going to dive into that further tomorrow. |
You could also give backstage repo a try from the root 👌 |
The regression I've mentioned in my last comment has been fixed (and actually improved by a good margin) in
Latest
|
The current dir walking method has two issues:
The reason for both is in the globby() calls.
On every globby(patterns) call, previously this happened:
globby is just a thin wrapper around fast-glob, and the only globby feature knip uses is the gitignore function, which has caused multiple issues in the past (That's the reason the docs recommend trying to turn off gitignore detection for performance afaik?).
With this PR, instead this happens on a globby() call:
The repo is walked using @nodelib/fs.walk, incrementally building the gitignore list. Files in gitignored directories are not walked (that's what fast-glob uses internally, but fast-glob doesn't support dynamic ignores (Feature request: dynamically ignore files or directories mrmlnc/fast-glob#265)).
The gitignore list is cached per passed cwd so they are only ever computed and parsed once.
Since I don't think there's a lib that actually supports collecting multiple gitignore I wrote a small gitignore->micromatch converter. This converter is probably not perfect, but the globby solution is also wrong (I can give details if needed).
fast-glob is called with the patterns and ignore list from knip itself combined with the ignore list from the gitignore files. This causes fast-glob to also do early pruning which makes walking here much faster as well.
This should fix #425 and inverse the recommendation to set
gitignore: false
. Instead, knip should now be faster when ignoring more files which makes sense.The reason I'm submitting this here and not to globby itself is that (a) globby is written in JS not TS and (b) has multiple other functions that would probably be affected but knip doesn't care about.